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Abstract 



The adaptive multi-channel method is applied to derive probability distributions 
from data samples. Moreover, an explicit algorithm is introduced, for which both 
CN I the channel weights and the channels themselves are adaptive, and which can be 

used both for data analysis and for importance sampling in Monte Carlo integration. 
^T) . Finally, it is pointed out how the usefulness for data analysis can be used to optimize 

O I the integration procedure. 
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^ : 1 Introduction 

^ ; Since its introduction in [T], adaptive multi-channeling has been extensively used as a help for 
^ I importance sampling in numerical integration. The probability density, following which the in- 
tegration points are generated, is written as a sum of densities with positive weights, and these 
weights are optimized during the integration process, in order to minimize the variance. The den- 
sities in the sum are called channels. In this paper, two variations of this method are introduced. 

Firstly, we observe the similarity between the optimization of the density following which the 
integration points are generated, and the creation of a histogram in data-analysis. A histogram 
is a weighted sum of densities, given by the normalized non-overlapping indicator functions of 
subsets of the space in which the data-points take their values. These indicator functions are the 
'bins' . The optimal weights are estimated by the number of data points in the bins, and these are 
exactly the maximum likelihood estimators. In this paper, we will see that a histogram is a special 
case of the result of the application of a multi-channel method to derive the probability distribu- 
tion. This will lead to 'generalized histograms' or unitary probability decompositions (u.p.d.s), 
by the use of general probability densities instead of the indicator functions. In Sectional it will 
be pointed out how the weights can be optimized. 
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Secondly, we note that the adaptation in the application of the multi-channel method to im- 
portance sampling has mainly been applied to the weights in the sum of weighted densities. The 
channels themselves are fixed, which presupposes some knowledge about the integrand based 
upon which the particular channels have been chosen. If one does not want to rely on such knowl- 
edge to much, one can try to adapt also the channels. The simplest way to do this is by discarding 
channels with low weight during the integration process, or choosing new sets of channels from 
a larger given pool of channels 0. A more advanced way of channel adaptation would be in the 
spirit of the VEGAS -algorithm or the FOAM-algorithm |@|, by creating new channels based 
on statistical analyses of the existing channels. The disadvantage of the first algorithm is that 
it is of limited efficiency in more than one dimension if the integrand has non-factorizable peak 
structures. The disadvantage of the second algorithm is that its complexity increases factorially 
with the dimension of the integration space. The problem with VEGAS can be solved if some 
information about the integrand is available, through the use of various channels corresponding 
with different coordinate systems Q. 

In Section|3l the algorithm PARNI will be introduced. It uses adaptive multi-channeling, with 
channels that are fully adaptive themselves. It has no a priori problems with non-factorizable 
peak structures in the integrand, and its complexity grows linearly with the dimension of the 
integration space. Except for automatic importance sampling in Monte Carlo integration, PARNI 
can also be used for the creation of a u.p.d.. In Section |4l we will see how this property can be 
used to optimize the integration process, and this strategy will be applied to a problem in phase 
space integration. 

2 Multi-channeling for data analysis 

The derivation of a probability distribution from a sample of data is a common problem in sci- 
entific research. The idea is that such a distribution exists a priori, and that the data are drawn at 
random. The Bayesian interpretation then tells us that the a priori distribution (a.p.d.) gives the 
probability for this particular sample of data to be drawn. The frequentist interpretation tells us 
that estimators, calculated with the sample, will converge to the same values as the ones from the 
a.p.d. if the sample becomes very large. This interpretation can be translated into the statement 
that the distribution of the sample converges to the a.p.d. if the sample becomes large. 

An obvious way to derive this distribution is by using such estimators, and the assumption 
that the distribution belongs to a class that can be completely described by the parameters that are 
estimated. In most cases, the original problem is even stated such, that the distribution is of a class 
that can be completely described by a set of parameters, and that one only wants to determine 
the correct values of these parameters. Part of the problem then becomes the determination of 
the right estimators, for example the maximum likelihood. 

One particular kind of estimators are the number of data points in subsets of the space in 
which the data points take their values. Such a number, divided by the total number of data 
points, estimates the integral of the a.p.d. over the subset. The estimates for a collection of non- 
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overlapping subsets, or bins, that cover the whole space constitute a histogram, and in certain 
limits in which the number of bins and the number of data go to infinity, one can speak about 
convergence of the histogram to the probability density of the a.p.d.. 

A histogram is a weighted sum of densities, given by the normalized indicator functions of 
the subsets. The estimators described above are exactly the maximum likelihood estimators of 
these weights. In this section, we will see that a histogram is a special case of the result of 
the application of the multi-channel method to derive the probability distribution, and we will 
see how to generalize it to a unitary probability decomposition (u.p.d.), by the use of general 
probability densities instead of the indicator functions. 



2.1 Maximum likelihood and entropy 

If data points are assumed to be distributed following a probability density which is given up to 
the values of certain parameters, the maximum likelihood gives estimators for these parameters. 
For our application, it will be more convenient to formulate the maximum likelihood method in 
terms of the maximization of an entropy. The space in which data points tu take their values is 
denoted CI, and this space may be multi-dimensional. The entropy of a probability density P on 
£1 is given by 



H(P) = 



P(tu]logP(cu) dcu 

o 



It can be used to determine the probability distributions of a system for which some information 
is available. This information could be the values of characteristics like mean and variance, but 
could also be knowledge that P — Gx, where Gx is given up to the values of the parameters 
X = (xi , X2, . . . , Xn). The shape of P, or the values of x, should be such that the entropy assumes 
its maximum value ^ 

But we are interested in a different situation, with a sample or a stream of data as the only 
information available. So the a priori probability density P is given somehow, and we do not 
know it, but want to approximate is with the help of the data, using a parametrized density Gx, 
specially chosen for this task. P defines a probability measure on O, and for a measurable 
function f we write 

(f)p:= f(tu]P(tu]dcu . 

Ja 

In search for the optimal values of x, we introduce the entropy of Gx relative to P: 

H(P;Gx):=(logGx)p . (1) 

It is the original entropy with logP replaced by logGx, and naturally, one would expect that if x 
is optimal, then the two entropies are close together. Encouraged by this expectation, we state 
that 

Principle 1 x is optimal if H(P; G^] assumes its maximum value. 



^Or such that — H(P) assumes its minimal value. 
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Extrema of H(P; G^) are given by solutions of the equations 

= — H(P;GJ = — (logG^P , i = 1,...,n . 

The connection with the maximum likelihood can be established by realizing that, in real life, 
one does not know P, so that integrals over D. have to be estimated with the help of the available 
data, which are distributed following P. In fact, if tu = (a>i , (Vz, . . . , cun) is such a sample of 
data points, and if ( )p exists, then 

lc=l 

where convergence takes place at least in probability, like in Monte Carlo integration. Using this 
estimator for the integral, the equations become 

9 9 9 

so that solutions give extrema of the likelihood function. We prefer to stick to the entropy formu- 
lation from now on, because it seems more appropriate in the case of a continuous data stream, 
a situation that resembles the one of Monte Carlo integration, where a continuous stream of data 
points is generated in order to integrate a function. 



2.2 Entropy and multi-channeling 

From now on, Gx will always be linear in the parameters x = (xi , X2, . . . , Xn), and be defined 
with the help of n probability densities g^, or channels, by 

n 

G^[w] = ^Xigi(cu) . (2) 

The parameters, or weights will always be positive, and normalized such that = 1 • If we 

look for extrema of the entropy dU), we have to take care that the weights stay normalized, and 
instead of including this normalization in Gx explicitly, we prefer to extend the entropy with the 
help of a Lagrange multiplier. So we want to find the maximum of 

n 

H(P; Gx, A) := ( log Gx )p - a(i - }^ Xt) 

i=i 

with respect to x and A. Extrema are solutions to the equations 

u 

(gi/Gx)p = A , i = 1,...,n and ^ Xi = 1 . 
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The value of the Lagrange muhiplier A can be found by muhiplying the equation ( gi/Gx ) p = A 
with Xi and taking the sum over i. Remembering @, we then find that A = 1 . 

The question is now whether this solution corresponds to a maximum, and performing an 
analysis like in ID, we can establish that it leads to at least a local maximum: denoting the 
solution by x and taking a small variation e — (£i, . . . , Cri) with 'YJl=.\ — 0, around the 
solution we find 

H(P; G^+„ 1 ) = H(P; G^, 1 ) - 1( G^/G^ )p + ©(e^) , 

and we see that small variations lead to a decrease of the entropy. We can re-formulate Principle[T] 
now, and state that if Gx is a weighted sum of channels gt, then 

Principle 2 x is optimal if the ( gi/Gx ) p are equal (to 1 ) for all i — ,n. 
2.3 Numerical path to the solution 

In general it is difficult to find an analytic solution to the problem posed by Principle |2| We 
can, however, try to find the solution, or at least an approximation, by numerical methods. We 
follow the same path as in lUl (Appendix by considering the case in which the channels gi 
are normalized indicator functions of non-overlapping subsets of D. with volume v^, so that Gx 
is a histogram. Let di = v^g^ denote the indicator functions. Then 

( gi/Gx)p = — (^i)p so that the solution is given by Xi = (-&t)p . 

X| 

So the optimal weight of channel gt is given by to the integral of the probability density over 
the subset corresponding to the indicator function. The weight for the indicator function is then 
given by Xi/vi: the height of the bin, like usually for a histogram. We observe now that, starting 
from any x, the successive operations 

Algorithm 1 (unitary probability decomposition) 

1. Ui f- Xi( gi/Gx)p foralli = 1,...,n 

2. Xi <— — for all 1 = 1 , . . . , n 

lead directly to the optimal values for x, and this gives us faith to seek for the solution in the 
general case by recursive application of these operations. 

In real life again, we cannot calculate ( gi/Gx)p, and have to estimate it with an available 
sample cu by ( gi/G^)^^. Notice that, in the case that Gx is a histogram, Xt( gi/Gx)uj = 
is the number of data points in bin i. 
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Figure 1: Histograms (upper graphs) and Cauchy-u.p.d.s (lower graphs) for random data-points, 
distributed following the dashed curve. 



Figure 2: Left: gj for n = 21 . Right: 1 0"^ data points taken in batches of less than 1 0^, with only 
one iteration of Algorithm [l] for each batch. 

2.4 An example of unitary probability decompositions 

As a small application, we show how the above may be used to construct u.p.d.s. For simplicity, 
we consider the one-dimensional case of a histogram on the interval [0, 1]. Instead of n normal- 
ized indicator functions gi(cu) = nQ{a) — — cu) of n bins with width 1 /n, we use n 
Cauchy densities 

A- 

9i(^) ■= T I ra.-cu,y With (T, = I , cut = ^ , 



and normalization 1 /A^ = arctan(^^^^) +arctan(-^). The left of FigureEldepicts g/forn = 21. 
In Figure[T]we present the histograms (upper graphs) and Cauchy-u.p.d.s (lower graphs) for 1 0^, 
1 0^ and 1 0"* random data-points, distributed following the density depicted with the dashed curve 
in all graphs. Both types consist of 21 channels, and for the generalized type the value of the 
weights after a maximum of 1 0^ iterations of Algorithm[T]is used. 
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For high number of data points, a high number of iterations becomes inappropriate, and in 
case of a data stream, the number of data points increases continuously. In those cases, batches 
of data points can be used, and Algorithm [T] can be applied once with each batch. In order to 
choose a size for the batches, we take into account the common rule that, in a normal histogram, 
every bin should contain at least a few data points in order to trust it. The number of data points 
from a sample cu = (a>i , tui, • • • , <^n] in the bin corresponding to indicator function = vtgt 
is given by 

and we can use this for the general case. Of course, ( g?/P )p can only be estimated, for example 
with ( g?/ Gx )tu- The right of Figure |2l shows the result with 1 0"* random data points, taken in 
batches. The size of the batches was such that the 'generalized number of data points' for each 
channel was at least 35, which happened to lead to batches with not more than 1 0^ data points. 



3 Multi-channeling with adaptive channels 

In the discussion so far, the channels gi were fixed, and the only adaptation appeared for the 
weights Xi. As described in the introduction, it would be attractive to also adapt the channels. 
We introduce the algorithm PARNI^ in which this is achieved. We consider the s-dimensional 
hypercube O = [0, 1]^. The channels are all normalized indicator functions, which are, however, 
not necessarily non-overlapping. The subsets corresponding to the indicator functions will only 
be boxes of the type [ai , bi] x [ai, bi] x ■ • • x [qs, bg]. 

Algorithm 2 (PARNI) 

1. We start with 2s channels corresponding to all pairs of boxes obtained by dissecting Q. in 
two along the Cartesian directions. 

In the case of the data stream, data points are generated from an external source, and in the case 
of numerical integration, they are generated from Gx, the density constructed with the channels 
and the weights. For completeness, we repeat the algorithm to generate the points in the case of 
numerical integration: 

2. choose a channel with a probability equal to the weight of the channel; 

3. generate a point in the box corresponding to the channel, uniformly distributed. 
A batch of data is collected, and 

4. depending on the task. Algorithm [T] or Algorithm l3l (Appendix IXI) is applied to optimize 
the weights. 

^Practical Adaptive Random Number Idealizer. 
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5. Directly after an optimization step, the box with the highest weight is replaced by 2s boxes, 
obtained in pairs by dissecting the original box in two along the Cartesian directions. 

This step makes the algorithm fully self-adaptive. Executed for the first time in two dimensions, 
this step could look as follows 



The 4 overlapping boxes on the l.h.s. cover the integration space twice. We just drew them in 
pairs next to each other for clarity. Suppose the upper right box has the largest weight. Then 
it is replaced^ by the smaller boxes on the r.h.s., so that we end up with 7 overlapping boxes 
which cover the integration space 2j times. The addition of boxes cannot be continued forever 
in practice, and one would like to restrain the number of channels to a maximum. 

6. If the number of channels reached its maximum, boxes with the smallest weights can be 
merged by replacing them by the smallest possible (new) box that contains all of them. 

Notice that this last procedure is very simple for Cartesian boxes. Also notice that the merging 
of boxes is necessary, and that one should not just throw away boxes, because of the danger of 
ending up with 'holes' in the integration space. Of course, the last two steps can be repeated a 
few times before gathering new data points in order to replace more channels at once and possibly 
accelerate the optimization process. 

One could ask the question why using all the overlapping boxes, and not use just one new 
pair in each step. The answer to this question is simply stated by a new question: which pair? 
The idea is to let the algorithm for the weight optimization decide which new boxes are going 
to be important. Of course, the number of new overlapping boxes in each step grows with the 
dimension of the integration space, but not drasticly, only linearly. 

Another question could be why to use boxes in the first place, and why not some other 
geometrical objects. The answer to this question is, firstly, the fact that the complexity of the 
algorithm runs the risk of exploding with the dimension of O. For example, the mininal number 
of s-simplices needed to fill [0, 1]^ is s!, while the number of boxes needed is 2. Secondly, there 
is the practical simplicity to encode the boxes. 

3.1 A simple application in two dimensions 

We present some results with probability density, or integrand, 

(0.02)2+ i^^jj^ _|_ 1 ]2 

where the normalization is not written down explicitly for convenience. The choice for this 

■'Replacements or assignments are denoted with tlie arrow pointing to the left: a <— b means 'b is put in the 
memory space of a' . 
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Figure 3: Left: P(tu) from Q. Right: Unitary probability decomposition Gx after 10^ data 
points distributed following P. 
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Results for the integration of density Q. 



particular density is taken from [4|. It is interesting because it is not factorizable, and VEGAS 
performs badly when used to integrate it. Since PARNI uses the Cartesian boxes, also here this 
density constitutes a serious test. All results with PARNI were obtained with a maximal number 
of 1 000 channels, batches of 1 000 data points for the multi-channel optimization, and with 20 
iterations of step s|5] and |6l 

The results for the integration of ^ are collected in Figure |3l Given are the result (average 
weight), the estimated error (standard deviation) and the efficiency (average weight divided by 
maximum weight). 

For the method 'integration with variance optimization', we see the effect of the optimization 
if we compare the result of the last 10^ data points with the result using all data points: the 
efficiency improves and the estimated error is less than 10 times worse, the '10' being expected 
from the 1 / -\/N-rule of Monte Carlo integration. 

For the method 'integration after u.p.d. creation', first 10^ data points distributed following 
Q were generated to create the u.p.d., and then 1 0^ integration points were generated using this 
u.p.d. to integrate Q. So observing that the result is better than in the case of variance optimiza- 
tion, one should keep in mind that twice as many data points have been used. Furthermore, this 
method of integration cannot be considered useful as long as we do not specify how to generate 
the first 1 0^ data points. We will adress this issue in Section |4l For now, the "smallness" of the 
error estimate should be interpreted as a measure of the "goodness" of the u.p.d.. 

The results of integration without any optimization are also included for comparison. Notice 
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that in that case, the efficiency including the total amount of data is actually better than in the 
case of variance optimization. Apparently, the optimization process for PARNI needs some 
trail- and-error in the beginning to end up at the results for the last 1 000 data points, which are 
obviously better than for the non-optimized case. This problem becomes more apparent in more- 
dimensional applications and can be solved following the method of 'integration after u.p.d. 
creation', as we will see in Section^ 



3.2 Application to phase space integration 



In the following, some results from the application of PARNI in phase space integration are 
presented. We consider the problem of calculating 



TV) Pn+I , 



(4) 



with 



(TL X n 

ndW(Pi-m?)e(p°))5(Q-}^p,) , (5) 
1=1 ' i=i 



and integrand 



A(po,Pl,...,Pn,Pn+i; 



(PO- Pi )(Pl -PlllPl-Psj-'-lPn-Pn+ljlPn+l " Po) 



(6) 



[j,0,0,^) and pn+i 



where (pt ■ Pj) denotes the Lorentz invariant scalar product, po 

0, 0, — -j). The problem in calculating this integral is the pole structure in the scalar products 
of the integrand. Although the integrand is regularized by a cut-off Sc in these scalar products, 
it still has a peak structure that makes convergence in a straightforward numerical calculation 
problematic. Integrals with this type of singularity structures are typically found in the phase 
space integration of QCD amplitudes, and they are called antenna pole structures [|2j|6J. 

The straightforward numerical (Monte Carlo) calculation would consist of the generation of 
TL random momenta pi uniformly distributed in phase space, the bounded (3n — 4) -dimensional 
subspace of R"^^ encoded in dO, and the calculation of the average of the integrand. Each set 
of random momenta is constructed from a set of random numbers between and 1 , and the idea 
is now to let PARNI deliver these numbers. For the construction of the momenta, there are two 
algorithms on the market: RAMBO, which uses the so-called democratic approach and the 
hierarchical construction of momenta (HI COM). The disadvantage of RAMBO for our application 
is that is needs 3n instead of Sn — 4 random numbers per n momenta, and we want to keep the 
dimension as low as possible"^. Furthermore, the freedom one has in the actual implementation of 
HI COM allows for an algorithm that is completely equivalent with RAMBO in the case of massless 
momenta. This implementation is presented in Appendix iBl 

"^The original code by R. Kleiss uses 4rL random numbers per n momenta, but this can easily be reduced to 3n 
with Uttle extra cost. 
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graph 4 



2 2.5 3 3.5 4 4.5 5 2 2.5 3 3.5 4 4.5 

Figure 4: The process of convergence during the Monte Carlo calculation of (|4l) for n = 4 
massless momenta with center of mass enery a/Q^ = 1 000 GeV and = 450 GeV^. Along 
the horizontal axis runs ^°log(# events). Two curves of the same type give the average plus the 
standard deviation and the average minus the standard deviation. 
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Figure |4| shows the results for the case of n = 4 massless momenta with a center of mass 
energy a/Q^ = lOOOGeV and a cut-off Sc = 450 GeV^. There, the process of convergence 
during the Monte Carlo integration is shown. Along the horizontal axis runs ^°log N, where N 
is the number of generated events (sets of momenta). In each graph, the two curves of the same 
type give the average plus the standard deviation and the average minus the standard deviation 
after the number of events on the horizontal axis. 

Graph 3 shows the result with phase space generators HAAG and SARGE, which were spe- 
cially designed to integrate functions with antenna pole structures EE!- This is how a decent 
Monte Carlo integration process should look like. The standard deviations converge to zero and 
the average moves around a straight line: the process is unbiased and the estimated error can be 
trusted at any point. 

SARGE and HAAG were developed to cure graph 1, the result with HI COM and RAMBO. The 
standard deviations hardly converge and the estimated error cannot be trusted at all: as long as 
no event hits a peak, the average is an under estimation. If an event hits a peak, the standard 
deviation increases drastically. Notice that the horizontal axis runs over 1 00 times more events 
than in graph 3. Because of the particular peculiar behavior of HI COM in this run, graph 2 with 
the result of the integration of the square root of the integrand, which has a much softer singular 
behavior, is included. We see a decent Monte Carlo process again and see that HI COM behaves 
the same as RAMBO. 

Graph 4, finally, depicts the results with HI COM in combination with PARNI. For the curve 
'during optimization', PARNI started from scratch. We see the peculiar behavior of HICOM 
back in this run, for which PARNI tries to compensate. PARNI used a maximal number of 1 000 
channels, batches of 1 000 data points for the multi-channel optimization with Algorithmic and 
4 iterations of steps |5] and |6l After the generation of 10^ events, the channels of PARNI were 
stored, and the process was repeated starting with these channels, leading to the curves 'after 
optimization', which show a decent Monte Carlo process again. 

4 The use of u.p.d.s in numerical integration 

The behavior of PARNI during optimization in Figured is what one would naturally expect 
from a general-purpose self adaptive Monte Carlo integration systems. A general-purpose sys- 
tem starts with no information about the integrand, and cannot do anything else then start with 
generating integration points distributed uniformly over the integration space. If the integrand 
has sharp peaks, there is no enhanced probability for the system to hit these peaks in the begin- 
ning, and an under estimation of the integral is the result. One can imagine the extreme case of 
an integrand consisting of a sum of delta-peaks, for which the probability to see them is zero. 

A solution to this problem is given by the possibility to create a u.p.d. before starting with the 
integration. This u.p.d. can then serve as a starting point for the self adaptive integration system. 
In order to create the u.p.d., one needs a stream of data points that are distributed following 
a density that looks like the integrand. This stream of data will immediately show the peak 
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structure, and in the extreme case mentioned before it will only show the delta-peaks. Important 
in this idea is that we realize that the creation of such a stream is, in principle, an easier problem 
than the integration problem: one can, for example, use the Metropolis algorithm. 

Since we want to restrict ourselves to integration problems in phase space like in the previ- 
ous section, we refer to [8| for the use of the Metropolis algorithm. The disadvantage of this 
algorithm for the use of integration is that, although it creates a stream of data points that are dis- 
tributed following any normalized positive function on the integration space, it does not give this 
normalization, the determination of which usually is the actual integration problem one wants to 
solve. 

In the application we just described this normalization is not needed. Having in mind, how- 
ever, that the evaluation of the integrand is expensive, we want to use all these evaluated integra- 
tion points as efficiently as possible. Let us denote 

, ^ r„f(cu] P(a;]dcu , ^ f 
(f)p:- ^^ , (P):=J^P(cu]da; 

for any integrable positive bounded function P on the integration space O. With the Metropolis 
algorithm, one can estimate ( f ) p for any quadratically integrable f , but one cannot calculate 
( P ) . Inspired by the solution to this problem proposed in JSJ, we observe that 

(P) = (pi-«)p„(p-) 

for any <x G [0, 1]. We also observe that the function P'^ will be easier to integrate than P 
itself because the peaks are suppressed. There is an equilibrium: for small a the calculation of 
( P*) will be easy, but the estimation of ( P^^*)pa will be difficult, whereas for oc close to 1 the 
estimation of ( P^^")pa will be easy, but the estimation of ( P") difficult. Important in light of 
the foregoing is that the estimation of ( P^^")pa will serve a stream of data points distributed 
following P* which can be used to create a u.p.d. for the estimation of (P"). This sounds 
tricky, like one uses integration points twice, but this is not the case. The integration points used 
to estimate ( P^^")pa are used to initialize an integration system to calculate ( P'^) with new 
integration points. 

The question that remains is what value to choose for <x. We shall stick to values close to 
1 , so that the estimation of ( P^^'^)pa is relatively easy, and the problem of calculating ( P*) is 
close to the original integration problem. 

4.1 Application to phase space integration 

It appears that PARNI performs better with a small change in the algorithm, namely by merging a 
number of boxes after every optimization step, so by performing step|6leven before the maximum 
allowed number of channels has been reached. The positive effect must be a result of the gain in 
flexibility. For the following calculations, a maximum number of 4000 channels has been used, 
with an optimization step after every 4000 events. During every such step, more-or-less 50 boxes 
were merged and step |5l was iterated as many times as it takes to add more-or-less 1 20 boxes. 
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Another advantage of the use of HI COM instead of RAMBO in combination with PARNI 
shows up for the use of the hybrid Metropolis procedure sketched before, namely the fact that 
the mapping from random points in the (3n— 4) -dimensional hypercube to momenta is invertible. 
This means that, for the initialization of PARNI through the creation of the u.p.d., the momenta 
generated with the Metropolis algorithm as described in lO can be used. One could, of course, 
use another implementation of the Metropolis algorithm in the (Sn — 4) -dimensional hypercube 
and map these points to momenta, but we do not want to go into this issue at this stage, not in the 
least place because of the (extreme) simplicity of the algorithm from [jSJ. 

We apply all this to the problem of calculating Q, with the function A replaced by a sum 
over permutations of its arguments, so that the integrand looks more like a Q CD-amplitude: so 
we want to calculate 



dOn(Q;m^, . . . , m^;pi, . . . ,Pn) Asym(po,Pi, • ■ . ,Pn,Pn+i; 



(7) 



with 



'^sym(PO) Pi ! • • • ) PtV) Pn+1 ) — P7t(1)) • • • ) PttIti)) P7T(n+l)J ) 

neSym(^x+^) 

and with dOn, A as in ©, ©. Below we put the integration process for the calculation of 
( Ajynf ) A^a^ and the initialization process of PARNI in a flow chart. The variables are the set p 
of n momenta and cu G [0, 1]^"^^"^. 



Metropolis with A^ 



a 

sym 



p ^ 


HICOM"^ 




PARNI 



ip 



integrand A^^^* 



sum (— sum + Al^^{-p) 



In the chart for the integration process of ( A^^j^ ) and further optimization of PARNI, also the 
weight factors coming from PARNI and HICOM are included: 



PARNI 



UJ 



HICOM 



integrand A^^ 



sym 



sym 



(P) 



WpWhA" 



sym 



(P) 



sum <— sum + WpWaA^^^lip) 



The results are presented in Figure|51for n = 4, and in Figure|^for n = 6. The parameter a was 
put to 0.9 for all cases. 

We start the discussion with the results for n = 4. In order to obtain the curves with the 
title 'Metropolis' in graph 1, the calculation of (A^^) was done first with SARGE, so that 
(Asym) — { {^^ym)^lyi^)A^y,^ could bc Calculated directly with the metropolis algorithm. In 
a realistic calculation this is not necessary: this has only been done in order to arrive at curves 
at the same scale as the curves from HAAG and graph 2. The Metropolis-curves converge rather 
quickly, as expected since the weights do not fluctuate much because of the suppression from the 
exponentiation with 1 — a. 
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Graph 2 depicts the process of convergence in the direct calculation of ( Asym ) with PARNI 
and RAMBO. The latter under estimates the integral for many events and hardly converges. 
PARNI starts converging quickly after 1 0^ events, obviously after the optimization took place. 

The fact that the estimated errors with RAMBO in graph 2 cannot be trusted becomes clear 
from the curves in graph 3 for the calculation of ( A^^^j^), which is supposed to be (slightly) 
easier for RAMBO, while the curves look worse. Graph 4 depicts the curves from PARNI for the 
calculation of ( A^y^ ), both with and without initialization using the data from the metropolis 
calculation of ( ) a^*^ . And clearly the curves with initialization converge faster. 

For n = 6, the graphs look more-or-less the same: the curves from RAMBO hardly converge 
or constitute an under estimation, while the curves from PARNI do converge. Furthermore, the 
curves with initialization reach a given estimated error sooner than without initialization. 

The tables coming with the figures speak for themselves, except maybe of the last two 
columns. These give a measure of the computation time, "normalized" with respect to the 
reached error. The first of the last two columns gives this number for the actual calculation, in 
units of the computation time of the integrand. The last column gives this number extrapolated to 
the case that the evaluation of the integrand would be much more expensive than the generation 
of events. For RAMBO, which can be considered to be a very cheap algorithm, these numbers are 
close to each other^. The conclusion that can be drawn from the tables is that PARNI performs 
much better than RAMBO, and that PARNI performs better with initialization than without. The 
reason why HAAG and SARGE perform much better than the rest is that these algorithms have 
been specially designed to integrate exactly Asym- 



^The odd situation with Metropolis for n = 6, where the first number is smaller than the second, is a result of 
the fact that a data point is re-used if a new one is not accepted, and does not add to the computation time. 
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Figure 5: The process of convergence during the Monte Carlo calculation of Q for n = 4 
massless momenta with center of mass enery a/Q^ = 1 000 GeV and = 450 GeV^. Along 
the horizontal axis runs ^°log(# events). Two curves of the same type give the average plus the 
standard deviation and the average minus the standard deviation. 
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The final results corresponding with Figure |5] I is the integral, a the standard deviation, N the 
number of generated events, Ttot the total computation time. Tint the time it takes to perform one 
evaluation of the integrand and Nacc the number of accepted (non-zero weight) events. 
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Figure 6: The process of convergence during the Monte Carlo calculation of Q for n = 6 
massless momenta with center of mass energy \/Q^ — 1 000 GeV and Sc = 450 GeV^. Along 
the horizontal axis runs ^°log(# events). Two curves of the same type give the average plus the 
standard deviation and the average minus the standard deviation. 
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The final results corresponding with Figure I is the integral, a the standard deviation, N the 
number of generated events, Ttot the total computation time. Tint the time it takes to perform one 
evaluation of the integrand and Nacc the number of accepted (non-zero weight) events. 
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5 Summary 



A histogram has been shown to be a special case of the application of the multi-channel method 
to derive a probability distribution from a sample, or a stream, of data. The channels are the 
normalized bins of the histogram, and the channel weights are the volumes of the bins. An 
algorithm has been presented how to optimize the weights in the case of channels that consist of 
arbitrary probability densities, leading to a unitary probability decomposition (u.p.d.). 

Furthermore, the algorithm PARNI has been presented, for which not only the channel 
weights, but also the channels themselves are adaptive. It can be used both for the creation of 
u.p.d. s and for automatic importance sampling in Monte Carlo integration. It handles data in an 
s-dimensional hypercube [0, 1]^ for arbitrary s. It has no a priori problems with non-factorizable 
peak structures in the integrand, and its complexity grows linearly with s. 

Finally, it has been shown how the property of PARNI to create a u.p.d. can be used to 
optimize the Monte Carlo integration procedure, and this has been applied to a problem in phase 
space integration. 
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6 Appendices 

A Multi-channeling for importance sampling 

In lUl, the multi-channel method was constructed such that the variance of the Monte Carlo 
estimator of the integral of a function is minimal. We briefly repeat the line of argument. A 
sample of data points cu = (tui , Wj, ■ . ■ , cun) is generated distributed following the density Gx, 
and the integral of integrand P over O is estimated by 



N 



Pfcul dw 



The variance of the estimator is given by 

P(cu]2 



1 



a 



dw 



Pfcu] dw 



(8) 



and extremization leads to the solution that the quantities 

g,[w]P{w)^ 



dcu 



a 



have to be equal for all i = 1 , . . . , n. If Gx is a histogram, then the solution is immediately found 
using Algorithm[T]with the first step replaced by 

Algorithm 3 (importance sampling by variance optimization) 

xVWi(P, Gx) for aU i = 1 , 



1. Vi 



n 



which is also applied in the general case. Of course the Wi(P, Gx) cannot be calculated exactly, 
but can be estimated by ( QiP^/Gx )uj' since cu is distributed following Gx- 
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B HICOM 

In the Hierarchical Construction Of Momenta, one uses the fact that the phase space can be 
decomposed as 

d(pT^(P; 0-1, ... , ffn'.Pl . . . ,Pn) = dSn-l dCD2(P; ffrv, Sn-liPn, Qn-l) 

X dSn-2 d02(Qn-i; ^n-l , Sn-l! Pn-1 , Qn-z) 

: (9) 

X dS2d(D2(Q3;CT3,S2;P3, Q2) d02(Q2;ff2, o-i;p2,Pi) , 

where 

dOjiQ; su S2; qi , qi) := 8[q^, - s, )e(q°) d\2Hc\j - S2)e(q°) 8\Q - q, - ^2) 

is the standard two-body phase space. This decomposition tells us that if the variables involved 
are generated in this order and with these dependencies, then the final momenta are distributed on 
the desired bounded (Sn — 4) -dimensional subspace of R"^^. It does not tell us how the momenta 
are distributed, and there is the freedom how to generate the variables in each of the two-body 
phase spaces, and the variables St. Examples of particular choices to obtain momenta that are 
distributed following the antenna pole structure can be found in [(2j|. 

A well-known example how to generate each of the two-body phase spaces is by generating 
an angle cp uniformly in [0, 2n], and a variable z distributed uniformly in [— 1 , 1 ], performing the 
construction 

Iqil^ VA(Q^si,S2)/W 
q? ^ \/si + IqiP 

q 1 I q 1 1 ( \/l — cos (p , a/I — sin (p , z ) , 

boosting q 1 to the center-of-mass frame of Q, and putting q2 = Q — q 1 . The symbol A stands for 
the standard Kallen function. This construction gives a Jacobian factor 2Q^/n/ ^J'KIQ^, Si , $2] 
in the density. If we look at the decomposition ® in the case that all squared masses ct^ are zero, 
we see that the Jacobian factors are equal to 2Q?/7r/(Q? — St_i). Since Q? = Si in the end, 
we see that the non-constant parts of the Jacobian factors cancel if the variables Si are generated 
following the density 

.,■ 1^ (Si+l - Si)(Si)'"^ 



leading to momenta that are uniformly distributed over phase space with constant density 
(n-1)(n-2)2P^ (n-2)(n-3)2 (2)(1)2 2 _ ( r(n)r(n- 1) 

(P2]n-1 7^ --y-J (p2)n-2 

The density for the variable Si is obtained by generating p G [0, 1] following the Z^eto-density 
|3i-i,2(p] =i(i-1)(1 - p)p'"^ and putting Si <- Si+i p. 
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Cheng's BA algorithm f9l is very efficient in generating random numbers following beta- 
distributions. However, it uses the method of rejection and needs, on the average, more than one 
uniformly distributed random number for returning one Z?eto-variable. What we would like is a 
direct construction with one random number as input, and the beta-variable as output, so that we 
can let PARNI deliver the input. Fortunately, Cheng's BA algorithm is so efficient that we can 
skip the rejection part, and just use its construction of a variable that is almost a beta-vanahle. If 
necessary, PARNI will compensate for that. For a general beta-density 



(3a.b(p)0c(1-p)^-V" 



The density for the almost-Z7eto-variable is given by 



d 

dp (a/b)^(1 - p) 



with 




v/(2ab- a-b)/(a + b-2) ifmin(a,b)>l 



(a, b) if min(a, b) < 1 
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